# WhiteWineQuality by HsiAnHuang

Introduction

這項EDA R語言專案,分析了白酒的成分關係變量與品質/Quality評價,我們先從網路找出一些知名葡萄酒以及介紹。 1. 夏多麗(Chardonnay),特色摘要:帶有溫和的熱帶水果香氣 2. 長相思(Sauvignon Blanc),特色摘要:葡萄酒口感偏乾,清新爽脆 3. 雷司令(Riesling)),特色摘要:一般酸度很高 4. 瓊瑤漿(Gewürztraminer)),特色摘要:甜型酒帶有荔枝風味,而乾型帶有花朵的芬芳。

  • 可以知道,主流歐洲白酒特色是,跟酸/acid有關,這裡可能有中英語言問題,acid以及sour翻譯上都是酸,不見得是舌頭上的酸味。

數據庫變量介紹, 1 - fixed acidity 2 - volatile acidity(太高則不理想,會有醋酸的味道) 3 - citric acid (可帶入清爽口感)

從白酒文字敘述上,我先預測,以上這三個變量,跟酸甜有關,應該是重要特徵值。

研究方面,我們使用板模項目,包含以下分析及探討反思。 - 單一變量/Univariate - 雙變量/Bivariate - 多變量/Multivariate

Univariate Plots Section

使用dplyr庫裡面glimpse,summary,去探索數據庫裡面的所有資訊,維度,變量數,變量型態,以及統計數據。

## Observations: 4,898
## Variables: 13
## $ V1                   <chr> "1", "2", "3", "4", "5", "6", "7", "8", "...
## $ fixed.acidity        <dbl> 7.0, 6.3, 8.1, 7.2, 7.2, 8.1, 6.2, 7.0, 6...
## $ volatile.acidity     <dbl> 0.27, 0.30, 0.28, 0.23, 0.23, 0.28, 0.32,...
## $ citric.acid          <dbl> 0.36, 0.34, 0.40, 0.32, 0.32, 0.40, 0.16,...
## $ residual.sugar       <dbl> 20.70, 1.60, 6.90, 8.50, 8.50, 6.90, 7.00...
## $ chlorides            <dbl> 0.045, 0.049, 0.050, 0.058, 0.058, 0.050,...
## $ free.sulfur.dioxide  <dbl> 45, 14, 30, 47, 47, 30, 30, 45, 14, 28, 1...
## $ total.sulfur.dioxide <dbl> 170, 132, 97, 186, 186, 97, 136, 170, 132...
## $ density              <dbl> 1.0010, 0.9940, 0.9951, 0.9956, 0.9956, 0...
## $ pH                   <dbl> 3.00, 3.30, 3.26, 3.19, 3.19, 3.26, 3.18,...
## $ sulphates            <dbl> 0.45, 0.49, 0.44, 0.40, 0.40, 0.44, 0.47,...
## $ alcohol              <dbl> 8.8, 9.5, 10.1, 9.9, 9.9, 10.1, 9.6, 8.8,...
## $ quality              <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 7,...
##       V1            fixed.acidity    volatile.acidity  citric.acid    
##  Length:4898        Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  Class :character   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Mode  :character   Median : 6.800   Median :0.2600   Median :0.3200  
##                     Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##                     3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##                     Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Univariate Analysis

Structure of dataset

在目標變量Quality,中位數/median是6,平均值5.878,得知大多數的評價,集中在評比6,最劣質評價是 3,最優質的為9。 pH酸鹼直數值的確介於 2 - 3 酸的範圍之間。

我們檢視一下所以變量的分佈圖

b1 <- ggplot(aes(x = fixed.acidity), data = wine) + geom_freqpoly()
b2 <- ggplot(aes(x = volatile.acidity), data = wine) + geom_freqpoly()
b3 <- ggplot(aes(x = citric.acid), data = wine) + geom_freqpoly()
b4 <- ggplot(aes(x = residual.sugar), data = wine) + geom_freqpoly()
b5 <- ggplot(aes(x = chlorides), data = wine) + geom_freqpoly()
b6 <- ggplot(aes(x = free.sulfur.dioxide), data = wine) + geom_freqpoly()
b7 <- ggplot(aes(x = total.sulfur.dioxide), data = wine) + geom_freqpoly()
b8 <- ggplot(aes(x = density), data = wine) + geom_freqpoly()
b9 <- ggplot(aes(x = pH), data = wine) + geom_freqpoly()
b10 <- ggplot(aes(x = sulphates), data = wine) + geom_freqpoly()
b11 <- ggplot(aes(x = alcohol), data = wine) + geom_freqpoly()
#Display all the plots in one chart
grid.arrange(b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

大致上變量分佈都像是,常態分佈Normal Distribution,唯有酒精濃度alcohol,分布很散。 從上一個分析,已知多數評價在6,依據分佈圖來看,多數是常態分佈,倒是給出了合理解釋,可以先假定評價6分佈大多在變量的分佈中值附近。 ### 我們在對酒精取對數,觀察分佈。

ps<-qplot(x = residual.sugar, data = wine,
      geom = 'freqpoly') +
  scale_x_log10()
pal<-qplot(x = alcohol, data = wine,
      geom = 'freqpoly') +
  scale_x_log10()
grid.arrange(ps,pal)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

即使取了對數可以看出數據沒有趨向常態分佈,全殘餘糖,出兩個峰值,酒精類似均勻分佈, 品質是否依據酒精成分趨向正相關,值得商榷。 預 有鑒於,全殘餘糖,出兩個峰值,我假定這是一個混合高斯分布的變量。 ### What is the structure of your dataset? 總共有4,898白酒樣本,11個變量/特徵,都是數值。 統計觀察

  1. 所有數據沒有NaN數值
  2. 品質評價為6佔了絕大多數
  3. 殘餘糖,看似混合高斯分佈

What is/are the main feature(s) of interest in your dataset?

殘餘糖,但是似乎要找出他對於其他特徵之間的相關性。 酒精,變動範圍很廣,在常理他的確是一個很主要的原因,可以扮演重要品質角色。 ### What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

每個特徵值都扮演不同角色,在品酒師舌頭裡,人畢竟不是味覺很強大的動物,很難辨識及小濃度的差異。cor(wine[,2:12], wine$quality) 在這前題假設,我倒是認為數據分布很小的變量,不會帶來很大的價值。 1. acid方面,fixed acidity可能是三個acid最重要的一項。 2. 變動量大的酒精可能是一個關鍵。 3. pH值,因為取過對數,所以 3 以及 3.3 是差距很大的

Did you create any new variables from existing variables in the dataset?

沒有,是可以用總硫酸扣除游離硫酸得到一個新變量,但是化學物理意義上沒太大意義, 其他變量,不懂彼此之間化學關聯,所以沒做新變量。

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

residual.sugar,看起來是偏斜的正太分佈,對他了做對數轉換,整體看來會稍微bell樣,但是出現了兩個混合常態分佈。

Bivariate Plots Section

前三關連度跟品質正相關的為 1. alcohol 0.435574715 2. sulphates 0.053677877 3. pH 0.099427246

前三關連度跟品質負相關的為 1. density -0.307123313 2. chlorides -0.209934411 3. volatile.acidity -0.194722969

相關圖也揭露了,值得注意的變量 - density跟residual sugar, total.sulfur.dioxide有強大正相關 - free.sulfur.dioxide 與 total.sulfur.dioxide有強大正相關 - citric.acid 與 fixed.acidity 有微弱的正相關性

wine$quality.factor <- factor(wine$quality)

scatter_plot <- function(variable) {
  ggplot(aes_string(x = "quality.factor", y = variable), data = wine) +
    geom_point(alpha = 0.3, position = "jitter") +
    geom_boxplot(alpha = 0.5) + 
    stat_summary(fun.y = "median",
               geom = "point",
               color = "red",
               shape =  8,
               size = 4) 
}

scatter_plot("alcohol") + 
  geom_smooth(aes(quality-2, alcohol), 
              data = wine, 
              method = "lm", 
              se = FALSE, 
              size = 1)

cor(wine$quality, wine$alcohol)
## [1] 0.4355747
scatter_plot("volatile.acidity") + 
  geom_smooth(aes(quality-2, volatile.acidity), 
              data = wine, 
              method = "lm", 
              se = FALSE, 
              size = 1)

cor(wine$quality, wine$volatile.acidity)
## [1] -0.194723

Volatile acidity 「揮發酸度」,走勢圖,有點類似只有存在,揮發酸度評價都不會太好。

scatter_plot("total.sulfur.dioxide") + 
  geom_smooth(aes(quality-2, total.sulfur.dioxide), 
              data = wine, 
              method = "lm", 
              se = FALSE, 
              size = 1)

cor(wine$quality, wine$total.sulfur.dioxide)
## [1] -0.1747372

排除評價低的白酒,總二氧化硫濃度越低,評價趨向越高,而二氧化硫並不是天然產物,而是避免紅白酒氧化作用的額外添加物,越少的添加物會讓酒的天然度大增。

scatter_plot("density") + 
  geom_smooth(aes(quality-2, density), 
              data = wine, 
              method = "lm", 
              se = FALSE, 
              size = 1)

cor(wine$quality, wine$density)
## [1] -0.3071233
scatter_plot("residual.sugar") + 
  geom_smooth(aes(quality-2, residual.sugar), 
              data = wine, 
              method = "lm", 
              se = FALSE, 
              size = 1)

cor(wine$quality, wine$residual.sugar)
## [1] -0.09757683

這裡可以得知,密度,殘餘糖越低,評價會走向越高的負相關性,其從文檔,也可以發現density的定義就是 density: the density of water is close to that of water depending on the percent alcohol and sugar content 酒精,密度,殘餘糖,這會是緊緊相連的化學作用造成的變量。

但是如果懂釀酒過程,不懂也可以從高中化學是知道,酒精的來源也就是糖,再氧化為醇,醛,酸。 認為這不是很意外的結果,糖氧化完全,就是高濃度酒精,沒有氧化完全,酒精產量就少,所以殘餘糖,酒精是一個trade off的關係。以下是驗證假設是否成立。

ggplot(aes(alcohol,residual.sugar), data = wine)+
  geom_smooth()
## `geom_smooth()` using method = 'gam'

cor(wine$alcohol, wine$residual.sugar)
## [1] -0.4506312

果然沒有任何一點意外,這兩個變量就是,一高一低的絕對關係,相關係數-0.4506312。

酒精如果氧化過多,就會變成過多的酸,我們就來驗證這假設是否成立。

a1 <- ggplot(aes(alcohol, fixed.acidity), data = wine)+
  geom_smooth()

a2<- ggplot(aes(x = alcohol, y = volatile.acidity), data = wine) +
  geom_smooth()  

a3 <- ggplot(aes(x = alcohol, y = citric.acid), data = wine) +
  geom_smooth() 
grid.arrange(a1, a2, a3)
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'

cor(wine$alcohol, wine$fixed.acidity)
## [1] -0.1208811
cor(wine$alcohol, wine$volatile.acidity)
## [1] 0.06771794
cor(wine$alcohol, wine$citric.acid)
## [1] -0.07572873

Alcohol and citric.acid: -0.07572873 Alcohol and fixed.acidity: -0.1208811 Alcohol and volatile.acidity: 0.06771794 酒精在citric.acid/檸檬酸,fixed.acidity/固定酸,呈現微弱負相關,但在酒精濃度11之後呈現強的負相關。 相反的例子,volatile.acidity/揮發酸,在酒精濃度11開始呈現正相關,我的解釋是揮發性酸是釀造過程中的副產物,如果酒裡面有醋菌/Acetic Bacteria會把酒精轉為揮發性酸,那麼濃度越高的酒精,產生的揮發性酸就越多,並不是一種一比一的轉換關係,而是揮發酸只要少量酒精,產生少量揮發酸就會造成酒劣質。 我們在下一個階段多變量可以探討,酒精,揮發酸,評價的關係。

我們來看看木塞污染

scatter_plot("chlorides") +
  geom_smooth(aes(quality-2, chlorides), 
              data = wine, 
              method = "lm", 
              se = FALSE, 
              size = 1) + 
  scale_y_log10()

cor(wine$quality, wine$chlorides)
## [1] -0.2099344

當木塞中寄居的真菌接觸到酒莊不衛生的環境或消毒殘留物中的chlorides/氯化物時,TCA 就形成了。因此如果酒莊使用帶有 TCA (三氯苯甲醚,2,4,6 - trichloroanisole)的軟木塞,那麼酒液也會相應受到一定程度的污染。 所以,氯化物是造成整體污染的指標,並非口感。

寒冷氣候產區的白葡萄酒pH值普遍在3.0之3.2之間,酸度的預防功效基於大多數菌類無法在這麼惡劣的環境裡生存,pH以及評價之間的關係。

scatter_plot("pH") + 
  geom_smooth(aes(quality-2, pH), 
              data = wine, 
              method = "lm", 
              se = FALSE, 
              size = 1)

cor(wine$quality, wine$pH)
## [1] 0.09942725

酸度比較低,評價走高,影響 pH的因素太多,各種化學物質。

再來,如果酸度濃度提高,pH數值總該要下降吧。這是高中化學常識等級的問題, 我們就在白葡萄酒數據驗證看看。

ggplot(aes(x = fixed.acidity, y = pH), data = wine) +
  #geom_point(alpha = 1/5) +
  xlim(quantile(wine$fixed.acidity, 0.01),
       quantile(wine$fixed.acidity, 0.99)) +
  ylim(quantile(wine$pH, 0.01),
       quantile(wine$pH, 0.99))+ 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 145 rows containing non-finite values (stat_smooth).

cor(wine$pH, wine$fixed.acidity)
## [1] -0.4258583

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

從雙重變量分析,可以得知,酒精是跟品質最高相關的變量,但是不怎麼強相關,但是多數變量排除 3, 4評等,是呈現跟品質強正負相關。

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

看了文檔 _density: the density of water is close to that of water depending on the percent alcohol and sugar content_這是定義上density跟糖酒精正相關。 但是酒精,糖分,密度之間的關係,可以用高中有機化學反應去解釋彼此間的關聯。 當酒精成分越高,表示糖分被氧化的比例越高,所以一瓶白葡萄酒,或是紅酒都可能存在,高糖,低酒精或低糖高酒精的比例關系。

有趣的是,出乎意外原先假設,酒精跟酸,會呈現負相關,結果並非如此,這三類酸,可能並非根糖有直接的關聯,也許是從葡萄成分,葡萄皮產生的。 ### What was the strongest relationship you found?

  1. density跟residual sugar, total.sulfur.dioxide有強正相關
  2. free.sulfur.dioxide 與 total.sulfur.dioxide有強正相關

Multivariate Plots Section

把酒話分成三個部分, 1. 劣質/Inferior(評價3,4) 2. 中等/Middle ,(評價5,6) 3. 優等/Highquality(評價7,8) 上圖顯示, 1. 劣質酒不論哪種濃度的酒精,他的揮發酸濃度都比較高。 2. 中等酒的濃度走向一致性 3. 優質酒在酒精濃度 11以下,普遍都有很低的揮發酸度。

但是濃度高的揮發酸度時候,卻否決了我之前的認知,不過outlier來說,優質白酒並沒有多數以及離譜的高揮發酸度,也許在揮發酸度0.6之前沒有顯著的醋酸風味。

grid.arrange(
  ggplot(wine, aes(density, fixed.acidity, color = quality.rating)) + 
  geom_point(alpha = .5) + 
   scale_colour_brewer(palette = "Reds") + 
   theme_dark() +
    xlim(quantile(wine$density, 0.01),
      quantile(wine$density, 0.99))+
    ylim(quantile(wine$fixed.acidity, 0.01),
      quantile(wine$fixed.acidity, 0.99))+
  geom_smooth(method = "lm", 
              se = FALSE, 
              size = 1)
  ,
ggplot(wine, aes(density, alcohol, color = quality.rating)) + 
  geom_point(alpha = .5) + 
   scale_colour_brewer(palette = "Reds") + 
   theme_dark() +
    xlim(quantile(wine$density, 0.01),
      quantile(wine$density, 0.99))+
    ylim(quantile(wine$alcohol, 0.01),
      quantile(wine$alcohol, 0.99))+
  geom_smooth(method = "lm",
              se = FALSE, 
              size = 1)
)
## Warning: Removed 167 rows containing non-finite values (stat_smooth).
## Warning: Removed 167 rows containing missing values (geom_point).
## Warning: Removed 157 rows containing non-finite values (stat_smooth).
## Warning: Removed 157 rows containing missing values (geom_point).
## Warning: Removed 10 rows containing missing values (geom_smooth).

ggplot(wine, aes(chlorides, sulphates, color = quality.factor)) + 
  geom_point(alpha = .5) + 
   scale_colour_brewer(palette = "Blues") + 
   theme_dark() +
  xlim(0, quantile(wine$chlorides, 0.95)) + 
  ylim(0.25, quantile(wine$sulphates, 0.95)) + 
  facet_wrap(~ quality.rating)
## Warning: Removed 463 rows containing missing values (geom_point).

氯化物,以及硫酸鹽化學反應不明顯,氯化物容易呈現氯離子cl-,硫酸鹽呈現硫酸都是負離子,彼此之間也的確不太會反應。

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

物理密度換算,酒精/固定酸,酒精/密度,顯示物理性不能違背。 酒精濃度高,通常評價高,相對密度就低,所以固定酸多數在密度低的情況會有很多優質酒。 這說一個換句話的機本蓋念。

Were there any interesting or surprising interactions between features?

揮發酸以及品質之間關係。 原先預期,揮發酸度濃度過高,會有醋酸味道,導致蓋掉了其他風味,以致於評價會降低。 但是數據上呈現,品質高的揮發酸度多數存在跟中等,劣質一樣的範圍,但是outlier明顯不多也不高。


Final Plots and Summary

Plot One

## [1] 0.4355747

Description One

影響評價最大的因素,就是酒精,相關係數達0.4355,數據只是弱相關,但是如果排劣質評價, 酒精確實跟評價呈現出強大正相關走勢。 造成劣質酒,也確實存在其他因素破壞整鍋粥,導致主要特徵(酒精)無法鶴立雞群,

Plot Two

## [1] -0.194723

Description Two

揮發性酸越高,評價越低,我們先找出原因 有揮發性酸 通常有高含量揮發 性酸的酒大多有很強烈刺鼻的味道,會蓋過其他酒的香味,在口中的餘味也會有著強烈如燃燒般的醋酸,有時還會有類似樹脂或去光水的氣味。

數據跟專業舌頭是呈現一致性的看法。

Plot Three

## [1] -0.2099344

Description Three

最後一張圖,介紹紅白酒容易造成的問題木塞污染 當木塞中寄居的真菌接觸到酒莊不衛生的環境或消毒殘留物中的chlorides/氯化物時,TCA 就形成了,那麼酒液也會相應受到一定程度的污染。 ——

Reflection

白葡萄數據,包含了4,898個支酒的樣本,為了探討白酒的化學成分,先從單一變數,研究數據分佈,是不是呈現高斯常態分佈,在檢視哪個變量跟主要目標(品質)呈現最高相關。 基於此高相關變量酒精,跟其他變量之間的關聯,找出次相關的變量,是否彼此之間互相影響

我找從相關係數,找出前三名,以及後三名的的變量,變並且用物理角度,去解釋最大附相關變量density是跟最大正相關變量酒精,之間的關係,以及從酒類知識,了解到導致品質低劣的變量(揮發性酸,氯化物)主要影響實質原因,(口感,木塞污染)。

最後,探索,最大正負相關變量以及他跟評價之間的關係,未來,我將會採用機器學習技法,Random Forest或是Entropy Based找出,統計上的真實原因,以及預測之後的白酒評價。